In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from factor_analyzer import FactorAnalyzer # Perform statistical tests before PCA 
import warnings
warnings.filterwarnings("ignore")
import os
import statsmodels.api as sm
from statsmodels.formula.api import ols      # For n-way ANOVA
from statsmodels.stats.anova import _get_covariance,anova_lm # For n-way ANOVA
from sklearn.preprocessing import StandardScaler
In [2]:
os.chdir('C:\\Users\\tmaji\\Downloads')
os.getcwd()
Out[2]:
'C:\\Users\\tmaji\\Downloads'
In [3]:
df=pd.read_csv('SalaryData.csv')
df.head()
Out[3]:
Education Occupation Salary
0 Doctorate Adm-clerical 153197
1 Doctorate Adm-clerical 115945
2 Doctorate Adm-clerical 175935
3 Doctorate Adm-clerical 220754
4 Doctorate Sales 170769
In [4]:
df.describe()
Out[4]:
Salary
count 40.000000
mean 162186.875000
std 64860.407506
min 50103.000000
25% 99897.500000
50% 169100.000000
75% 214440.750000
max 260151.000000
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Education   40 non-null     object
 1   Occupation  40 non-null     object
 2   Salary      40 non-null     int64 
dtypes: int64(1), object(2)
memory usage: 1.1+ KB
In [6]:
df.Education = pd.Categorical(df.Education)
In [7]:
df['Education'].value_counts()
Out[7]:
 Doctorate    16
 Bachelors    15
 HS-grad       9
Name: Education, dtype: int64
In [8]:
df.Occupation = pd.Categorical(df.Occupation)
In [9]:
df['Occupation'].value_counts()
Out[9]:
 Prof-specialty     13
 Sales              12
 Adm-clerical       10
 Exec-managerial     5
Name: Occupation, dtype: int64
In [10]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   Education   40 non-null     category
 1   Occupation  40 non-null     category
 2   Salary      40 non-null     int64   
dtypes: category(2), int64(1)
memory usage: 824.0 bytes
In [11]:
df.groupby('Education').mean()
Out[11]:
Salary
Education
Bachelors 165152.933333
Doctorate 208427.000000
HS-grad 75038.777778
In [12]:
df.groupby('Occupation').mean()
Out[12]:
Salary
Occupation
Adm-clerical 141424.300000
Exec-managerial 197117.600000
Prof-specialty 168953.153846
Sales 157604.416667
In [13]:
df.describe()
Out[13]:
Salary
count 40.000000
mean 162186.875000
std 64860.407506
min 50103.000000
25% 99897.500000
50% 169100.000000
75% 214440.750000
max 260151.000000

1.1. State the null and the alternate hypothesis for conducting one-way ANOVA for both Education and Occupation individually.

Null hypothesis = H0: Salary is depend on educational qualification and occupation. Alternate hypothesis = H1: Salary is not depend on educational qualification and occupation;i.e. Salary is dependent on either one of the component(Education or Occupation)

In [14]:
model = ols('Salary ~ Occupation', data=df).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
aov_table
Out[14]:
sum_sq df F PR(>F)
Occupation 1.125878e+10 3.0 0.884144 0.458508
Residual 1.528092e+11 36.0 NaN NaN

pvalue=0.458508 is greater than the level of significance 𝛼 0.05

The null hypothesis is fail to rejected based on the above observation and it is concluded that Occupation has no significant effect on Salary

In [15]:
model = ols('Salary ~ Education', data=df).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
aov_table
Out[15]:
sum_sq df F PR(>F)
Education 1.026955e+11 2.0 30.95628 1.257709e-08
Residual 6.137256e+10 37.0 NaN NaN

pvalue=1.257709e-08 is smaller than the level of significance 𝛼 0.05

The null hypothesis is rejected based on the above observation and it is concluded that Education has significant effect on Salary

1.2 Perform one-way ANOVA for Education with respect to the variable ‘Salary’. State whether the null hypothesis is accepted or rejected based on the ANOVA results.

In [16]:
model = ols('Salary ~ Education', data=df).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
aov_table
Out[16]:
sum_sq df F PR(>F)
Education 1.026955e+11 2.0 30.95628 1.257709e-08
Residual 6.137256e+10 37.0 NaN NaN

pvalue=1.257709e-08 is smaller than the level of significance 𝛼 0.05

The null hypothesis is rejected based on the above observation and it is concluded that Education has significant effect on Salary

In [ ]:
 

1.3 Perform one-way ANOVA for variable Occupation with respect to the variable ‘Salary’. State whether the null hypothesis is accepted or rejected based on the ANOVA results.

In [17]:
model = ols('Salary ~ Occupation', data=df).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
aov_table
Out[17]:
sum_sq df F PR(>F)
Occupation 1.125878e+10 3.0 0.884144 0.458508
Residual 1.528092e+11 36.0 NaN NaN

pvalue=0.458508 is greater than the level of significance 𝛼 0.05

The null hypothesis is fail to rejected based on the above observation and it is concluded that Occupation has no significant effect on Salary

In [ ]:
 

1.4 If the null hypothesis is rejected in either (1.2) or in (1.3), find out which class means are significantly different. Interpret the result.

In [18]:
#The null hypothesis is rejected based on the above observation and it is concluded that 
#Education has significant effect on Salary

#The null hypothesis is fail to rejected based on the above observation and it is concluded that
#Occupation has no significant effect on Salary

1.5 What is the interaction between the two treatments? Analyze the effects of one variable on the other (Education and Occupation) with the help of an interaction plot.

In [19]:
sns.pointplot(x = 'Education', y = 'Salary',hue='Occupation',data=df)
plt.grid()
plt.show()
In [20]:
sns.pointplot(x = 'Occupation', y = 'Salary',hue='Education',data=df)
plt.grid()
plt.show()

As seen from the above two interaction plots, there seems to be very interaction amongst the two categorical variables.

1.6 Perform a two-way ANOVA based on the Education and Occupation (along with their interaction Education*Occupation) with the variable ‘Salary’. State the null and alternative hypotheses and state your results. How will you interpret this result?

In [21]:
#perform two-way ANOVA
model = ols('Salary ~ C(Education) + C(Occupation) + C(Education):C(Occupation)', data=df).fit()
sm.stats.anova_lm(model, typ=2)
Out[21]:
sum_sq df F PR(>F)
C(Education) 1.799186e+11 2.0 126.512640 4.293626e-12
C(Occupation) 3.373139e+09 3.0 1.581251 2.229454e-01
C(Education):C(Occupation) 4.227791e+10 6.0 9.909463 1.323371e-05
Residual 2.062102e+10 29.0 NaN NaN

As Education and Occupation interaction is 1.323371e-05 which is <0.05 , there seems to be very statistical interaction has significant impact on Salary.

Since the p-value for Education is 4.293626e-12 lesser than .05, this means the factor have a statistically significant effect on Salary

Since the p-value for Occupation is 2.229454e-01 greater than .05, this means the factor have a statistically no significant effect on Salary.

1.7 Explain the business implications of performing ANOVA for this particular case study.

As Education and Occupation interaction is 1.323371e-05 which is <0.05 , there seems to be very statistical interaction has significant impact on Salary.

Since the p-value for Education is 4.293626e-12 lesser than .05, this means the factor have a statistically significant effect on Salary

Since the p-value for Occupation is 2.229454e-01 greater than .05, this means the factor have a statistically no significant effect on Salary.

In [ ]:
 

2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed]. What insight do you draw from the EDA?

In [22]:
data = pd.read_csv('Education+-+Post+12th+Standard.csv')
In [23]:
data.head()
Out[23]:
Names Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
0 Abilene Christian University 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60
1 Adelphi University 2186 1924 512 16 29 2683 1227 12280 6450 750 1500 29 30 12.2 16 10527 56
2 Adrian College 1428 1097 336 22 50 1036 99 11250 3750 400 1165 53 66 12.9 30 8735 54
3 Agnes Scott College 417 349 137 60 89 510 63 12960 5450 450 875 92 97 7.7 37 19016 59
4 Alaska Pacific University 193 146 55 16 44 249 869 7560 4120 800 1500 76 72 11.9 2 10922 15
In [24]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Names        777 non-null    object 
 1   Apps         777 non-null    int64  
 2   Accept       777 non-null    int64  
 3   Enroll       777 non-null    int64  
 4   Top10perc    777 non-null    int64  
 5   Top25perc    777 non-null    int64  
 6   F.Undergrad  777 non-null    int64  
 7   P.Undergrad  777 non-null    int64  
 8   Outstate     777 non-null    int64  
 9   Room.Board   777 non-null    int64  
 10  Books        777 non-null    int64  
 11  Personal     777 non-null    int64  
 12  PhD          777 non-null    int64  
 13  Terminal     777 non-null    int64  
 14  S.F.Ratio    777 non-null    float64
 15  perc.alumni  777 non-null    int64  
 16  Expend       777 non-null    int64  
 17  Grad.Rate    777 non-null    int64  
dtypes: float64(1), int64(16), object(1)
memory usage: 109.4+ KB
In [25]:
data.describe()
Out[25]:
Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
count 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.00000
mean 3001.638353 2018.804376 779.972973 27.558559 55.796654 3699.907336 855.298584 10440.669241 4357.526384 549.380952 1340.642214 72.660232 79.702703 14.089704 22.743887 9660.171171 65.46332
std 3870.201484 2451.113971 929.176190 17.640364 19.804778 4850.420531 1522.431887 4023.016484 1096.696416 165.105360 677.071454 16.328155 14.722359 3.958349 12.391801 5221.768440 17.17771
min 81.000000 72.000000 35.000000 1.000000 9.000000 139.000000 1.000000 2340.000000 1780.000000 96.000000 250.000000 8.000000 24.000000 2.500000 0.000000 3186.000000 10.00000
25% 776.000000 604.000000 242.000000 15.000000 41.000000 992.000000 95.000000 7320.000000 3597.000000 470.000000 850.000000 62.000000 71.000000 11.500000 13.000000 6751.000000 53.00000
50% 1558.000000 1110.000000 434.000000 23.000000 54.000000 1707.000000 353.000000 9990.000000 4200.000000 500.000000 1200.000000 75.000000 82.000000 13.600000 21.000000 8377.000000 65.00000
75% 3624.000000 2424.000000 902.000000 35.000000 69.000000 4005.000000 967.000000 12925.000000 5050.000000 600.000000 1700.000000 85.000000 92.000000 16.500000 31.000000 10830.000000 78.00000
max 48094.000000 26330.000000 6392.000000 96.000000 100.000000 31643.000000 21836.000000 21700.000000 8124.000000 2340.000000 6800.000000 103.000000 100.000000 39.800000 64.000000 56233.000000 118.00000

Univarient Analysis

In [26]:
plt.figure(figsize=(20,10))
sns.boxplot(data=data)
plt.grid()
plt.show()
In [27]:
sns.distplot(data['Enroll']);

From above figure, we can say that the Enroll parameter is right skewed

In [28]:
plt.figure(figsize=(12,8))
plt.subplot(1,4,1)
sns.distplot(data['Apps'])

plt.subplot(1,4,2)
sns.distplot(data['F.Undergrad'])

plt.subplot(1,4,3)
sns.distplot(data['Grad.Rate'])

plt.subplot(1,4,4)
sns.distplot(data['PhD'])
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x2531c8c2e20>
In [29]:
#form the above figure we can conclude that "Applications Recieved","Full time Under Grad" are right skiewed
#form the above figure we can conclude that "PhD" are Left skiewed
#from the above figure we can conclude the "Graduation Rate" is Normally distributed

Multivaraite Analysis

In [30]:
#Pairplot of all variables
sns.pairplot(data)
Out[30]:
<seaborn.axisgrid.PairGrid at 0x2531c9cd940>
In [31]:
cor=data.corr()
cor
Out[31]:
Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
Apps 1.000000 0.943451 0.846822 0.338834 0.351640 0.814491 0.398264 0.050159 0.164939 0.132559 0.178731 0.390697 0.369491 0.095633 -0.090226 0.259592 0.146755
Accept 0.943451 1.000000 0.911637 0.192447 0.247476 0.874223 0.441271 -0.025755 0.090899 0.113525 0.200989 0.355758 0.337583 0.176229 -0.159990 0.124717 0.067313
Enroll 0.846822 0.911637 1.000000 0.181294 0.226745 0.964640 0.513069 -0.155477 -0.040232 0.112711 0.280929 0.331469 0.308274 0.237271 -0.180794 0.064169 -0.022341
Top10perc 0.338834 0.192447 0.181294 1.000000 0.891995 0.141289 -0.105356 0.562331 0.371480 0.118858 -0.093316 0.531828 0.491135 -0.384875 0.455485 0.660913 0.494989
Top25perc 0.351640 0.247476 0.226745 0.891995 1.000000 0.199445 -0.053577 0.489394 0.331490 0.115527 -0.080810 0.545862 0.524749 -0.294629 0.417864 0.527447 0.477281
F.Undergrad 0.814491 0.874223 0.964640 0.141289 0.199445 1.000000 0.570512 -0.215742 -0.068890 0.115550 0.317200 0.318337 0.300019 0.279703 -0.229462 0.018652 -0.078773
P.Undergrad 0.398264 0.441271 0.513069 -0.105356 -0.053577 0.570512 1.000000 -0.253512 -0.061326 0.081200 0.319882 0.149114 0.141904 0.232531 -0.280792 -0.083568 -0.257001
Outstate 0.050159 -0.025755 -0.155477 0.562331 0.489394 -0.215742 -0.253512 1.000000 0.654256 0.038855 -0.299087 0.382982 0.407983 -0.554821 0.566262 0.672779 0.571290
Room.Board 0.164939 0.090899 -0.040232 0.371480 0.331490 -0.068890 -0.061326 0.654256 1.000000 0.127963 -0.199428 0.329202 0.374540 -0.362628 0.272363 0.501739 0.424942
Books 0.132559 0.113525 0.112711 0.118858 0.115527 0.115550 0.081200 0.038855 0.127963 1.000000 0.179295 0.026906 0.099955 -0.031929 -0.040208 0.112409 0.001061
Personal 0.178731 0.200989 0.280929 -0.093316 -0.080810 0.317200 0.319882 -0.299087 -0.199428 0.179295 1.000000 -0.010936 -0.030613 0.136345 -0.285968 -0.097892 -0.269344
PhD 0.390697 0.355758 0.331469 0.531828 0.545862 0.318337 0.149114 0.382982 0.329202 0.026906 -0.010936 1.000000 0.849587 -0.130530 0.249009 0.432762 0.305038
Terminal 0.369491 0.337583 0.308274 0.491135 0.524749 0.300019 0.141904 0.407983 0.374540 0.099955 -0.030613 0.849587 1.000000 -0.160104 0.267130 0.438799 0.289527
S.F.Ratio 0.095633 0.176229 0.237271 -0.384875 -0.294629 0.279703 0.232531 -0.554821 -0.362628 -0.031929 0.136345 -0.130530 -0.160104 1.000000 -0.402929 -0.583832 -0.306710
perc.alumni -0.090226 -0.159990 -0.180794 0.455485 0.417864 -0.229462 -0.280792 0.566262 0.272363 -0.040208 -0.285968 0.249009 0.267130 -0.402929 1.000000 0.417712 0.490898
Expend 0.259592 0.124717 0.064169 0.660913 0.527447 0.018652 -0.083568 0.672779 0.501739 0.112409 -0.097892 0.432762 0.438799 -0.583832 0.417712 1.000000 0.390343
Grad.Rate 0.146755 0.067313 -0.022341 0.494989 0.477281 -0.078773 -0.257001 0.571290 0.424942 0.001061 -0.269344 0.305038 0.289527 -0.306710 0.490898 0.390343 1.000000
In [32]:
plt.figure(figsize=(12,12))
sns.heatmap(cor, annot=True)
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x25325efb5e0>

In the above plot scatter diagrams are plotted for all the numerical columns in the dataset. A scatter plot is a visual representation of the degree of correlation between any two columns. The pair plot function in seaborn makes it very easy to generate joint scatter plots for all the columns in the data.

2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling.

Often the variables of the data set are of different scales i.e. one variable is in millions and other in only 100.

in this data set 3 different types of values are there 1)Number(4 to 5 digit) 2)Ratio(2 digit with deciaml) 3)Percentage(2 digit).Since the data in these variables are of different scales, it is tough to compare these variables.

Number: Apps,Accept,Enroll,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal. Percentage :Top10perc,Top25perc,PhD,Terminal,perc.alumni. Ratio:Grad.Rate,

Feature scaling (also known as data normalization) is the method used to standardize the range of features of data. Since, the range of values of data may vary widely, it becomes a necessary step in data preprocessing while using machine learning algorithms.

In this method, we convert variables with different scales of measurements into a single scale.

StandardScaler normalizes the data using the formula (x-mean)/standard deviation.

In [ ]:
 
In [33]:
new_df=data.copy()
new_df
## Dropping the name feature before we scale numeric values as the same will not add any value in model building

new_df.drop(labels='Names',axis=1,inplace=True)
new_df.head()
Out[33]:
Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
0 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60
1 2186 1924 512 16 29 2683 1227 12280 6450 750 1500 29 30 12.2 16 10527 56
2 1428 1097 336 22 50 1036 99 11250 3750 400 1165 53 66 12.9 30 8735 54
3 417 349 137 60 89 510 63 12960 5450 450 875 92 97 7.7 37 19016 59
4 193 146 55 16 44 249 869 7560 4120 800 1500 76 72 11.9 2 10922 15

2.3 Comment on the comparison between the covariance and the correlation matrices from this data.

Covariance and Correlation are two mathematical concepts which are quite commonly used in statistics. Both of these two determine the relationship and measures the dependency between two random variables. Despite, some similarities between these two mathematical terms, they are different from each other. Correlation is when the change in one item may result in the change in the another item. On the other hand, covariance is when two items vary together. Read the given article to know the differences between covariance and correlation.

A measure used to indicate the extent to which two random variables change in tandem is known as covariance. A measure used to represent how strongly two random variables are related known as correlation. Covariance is nothing but a measure of correlation. On the contrary, correlation refers to the scaled form of covariance. The value of correlation takes place between -1 and +1. Conversely, the value of covariance lies between -∞ and +∞. Covariance is affected by the change in scale, i.e. if all the value of one variable is multiplied by a constant and all the value of another variable are multiplied, by a similar or different constant, then the covariance is changed. As against this, correlation is not influenced by the change in scale. Correlation is dimensionless, i.e. it is a unit-free measure of the relationship between variables. Unlike covariance, where the value is obtained by the product of the units of the two variables.

In [34]:
#covariance
cov_matrix = np.cov(new_df.T)
cov_matrix
Out[34]:
array([[ 1.49784595e+07,  8.94985981e+06,  3.04525599e+06,
         2.31327731e+04,  2.69526635e+04,  1.52897025e+07,
         2.34662015e+06,  7.80970356e+05,  7.00072872e+05,
         8.47037526e+04,  4.68346833e+05,  2.46894337e+04,
         2.10530676e+04,  1.46506058e+03, -4.32712238e+03,
         5.24617110e+06,  9.75642164e+03],
       [ 8.94985981e+06,  6.00795970e+06,  2.07626776e+06,
         8.32112487e+03,  1.20134048e+04,  1.03935824e+07,
         1.64666972e+06, -2.53962285e+05,  2.44347147e+05,
         4.59428079e+04,  3.33556631e+05,  1.42382015e+04,
         1.21820938e+04,  1.70983819e+03, -4.85948702e+03,
         1.59627169e+06,  2.83416292e+03],
       [ 3.04525599e+06,  2.07626776e+06,  8.63368392e+05,
         2.97158341e+03,  4.17259244e+03,  4.34752988e+06,
         7.25790674e+05, -5.81188483e+05, -4.09970592e+04,
         1.72911997e+04,  1.76737970e+05,  5.02896117e+03,
         4.21708603e+03,  8.72684773e+02, -2.08169379e+03,
         3.11345431e+05, -3.56587977e+02],
       [ 2.31327731e+04,  8.32112487e+03,  2.97158341e+03,
         3.11182456e+02,  3.11630480e+02,  1.20891137e+04,
        -2.82947498e+03,  3.99071798e+04,  7.18670561e+03,
         3.46177405e+02, -1.11455119e+03,  1.53184870e+02,
         1.27551581e+02, -2.68745252e+01,  9.95672077e+01,
         6.08793102e+04,  1.49992164e+02],
       [ 2.69526635e+04,  1.20134048e+04,  4.17259244e+03,
         3.11630480e+02,  3.92229216e+02,  1.91589528e+04,
        -1.61541214e+03,  3.89924275e+04,  7.19990357e+03,
         3.77759266e+02, -1.08360506e+03,  1.76518449e+02,
         1.53002612e+02, -2.30971994e+01,  1.02550946e+02,
         5.45464833e+04,  1.62371398e+02],
       [ 1.52897025e+07,  1.03935824e+07,  4.34752988e+06,
         1.20891137e+04,  1.91589528e+04,  2.35265793e+07,
         4.21291009e+06, -4.20984304e+06, -3.66458224e+05,
         9.25357647e+04,  1.04170909e+06,  2.52117842e+04,
         2.14242417e+04,  5.37020858e+03, -1.37919297e+04,
         4.72403958e+05, -6.56330753e+03],
       [ 2.34662015e+06,  1.64666972e+06,  7.25790674e+05,
        -2.82947498e+03, -1.61541214e+03,  4.21291009e+06,
         2.31779885e+06, -1.55270428e+06, -1.02391862e+05,
         2.04104467e+04,  3.29732427e+05,  3.70675622e+03,
         3.18059661e+03,  1.40130256e+03, -5.29733709e+03,
        -6.64351154e+05, -6.72106249e+03],
       [ 7.80970356e+05, -2.53962285e+05, -5.81188483e+05,
         3.99071798e+04,  3.89924275e+04, -4.20984304e+06,
        -1.55270428e+06,  1.61846616e+07,  2.88659739e+06,
         2.58082421e+04, -8.14673718e+05,  2.51575151e+04,
         2.41641477e+04, -8.83525354e+03,  2.82295531e+04,
         1.41332357e+07,  3.94796818e+04],
       [ 7.00072872e+05,  2.44347147e+05, -4.09970592e+04,
         7.18670561e+03,  7.19990357e+03, -3.66458224e+05,
        -1.02391862e+05,  2.88659739e+06,  1.20274303e+06,
         2.31703134e+04, -1.48083768e+05,  5.89503475e+03,
         6.04729974e+03, -1.57420591e+03,  3.70143138e+03,
         2.87330848e+06,  8.00536018e+03],
       [ 8.47037526e+04,  4.59428079e+04,  1.72911997e+04,
         3.46177405e+02,  3.77759266e+02,  9.25357647e+04,
         2.04104467e+04,  2.58082421e+04,  2.31703134e+04,
         2.72597799e+04,  2.00430257e+04,  7.25342415e+01,
         2.42963918e+02, -2.08672067e+01, -8.22631321e+01,
         9.69125803e+04,  3.00883652e+00],
       [ 4.68346833e+05,  3.33556631e+05,  1.76737970e+05,
        -1.11455119e+03, -1.08360506e+03,  1.04170909e+06,
         3.29732427e+05, -8.14673718e+05, -1.48083768e+05,
         2.00430257e+04,  4.58425753e+05, -1.20898783e+02,
        -3.05154186e+02,  3.65415770e+02, -2.39931082e+03,
        -3.46097802e+05, -3.13261494e+03],
       [ 2.46894337e+04,  1.42382015e+04,  5.02896117e+03,
         1.53184870e+02,  1.76518449e+02,  2.52117842e+04,
         3.70675622e+03,  2.51575151e+04,  5.89503475e+03,
         7.25342415e+01, -1.20898783e+02,  2.66608636e+02,
         2.04231332e+02, -8.43649246e+00,  5.03832295e+01,
         3.68980582e+04,  8.55571090e+01],
       [ 2.10530676e+04,  1.21820938e+04,  4.21708603e+03,
         1.27551581e+02,  1.53002612e+02,  2.14242417e+04,
         3.18059661e+03,  2.41641477e+04,  6.04729974e+03,
         2.42963918e+02, -3.05154186e+02,  2.04231332e+02,
         2.16747841e+02, -9.33025564e+00,  4.87343271e+01,
         3.37334569e+04,  7.32203957e+01],
       [ 1.46506058e+03,  1.70983819e+03,  8.72684773e+02,
        -2.68745252e+01, -2.30971994e+01,  5.37020858e+03,
         1.40130256e+03, -8.83525354e+03, -1.57420591e+03,
        -2.08672067e+01,  3.65415770e+02, -8.43649246e+00,
        -9.33025564e+00,  1.56685279e+01, -1.97641094e+01,
        -1.20675646e+04, -2.08548884e+01],
       [-4.32712238e+03, -4.85948702e+03, -2.08169379e+03,
         9.95672077e+01,  1.02550946e+02, -1.37919297e+04,
        -5.29733709e+03,  2.82295531e+04,  3.70143138e+03,
        -8.22631321e+01, -2.39931082e+03,  5.03832295e+01,
         4.87343271e+01, -1.97641094e+01,  1.53556744e+02,
         2.70289215e+04,  1.04493815e+02],
       [ 5.24617110e+06,  1.59627169e+06,  3.11345431e+05,
         6.08793102e+04,  5.45464833e+04,  4.72403958e+05,
        -6.64351154e+05,  1.41332357e+07,  2.87330848e+06,
         9.69125803e+04, -3.46097802e+05,  3.68980582e+04,
         3.37334569e+04, -1.20675646e+04,  2.70289215e+04,
         2.72668656e+07,  3.50129683e+04],
       [ 9.75642164e+03,  2.83416292e+03, -3.56587977e+02,
         1.49992164e+02,  1.62371398e+02, -6.56330753e+03,
        -6.72106249e+03,  3.94796818e+04,  8.00536018e+03,
         3.00883652e+00, -3.13261494e+03,  8.55571090e+01,
         7.32203957e+01, -2.08548884e+01,  1.04493815e+02,
         3.50129683e+04,  2.95073717e+02]])

Both Correlation and Covariance are very closely related to each other and yet they differ a lot.

When it comes to choosing between Covariance vs Correlation, the latter stands to be the first choice as it remains unaffected by the change in dimensions, location, and scale, and can also be used to make a comparison between two pairs of variables. Since it is limited to a range of -1 to +1, it is useful to draw comparisons between variables across domains. However, an important limitation is that both these concepts measure the only linear relationship.

2.4 Check the dataset for outliers before and after scaling. What insight do you derive here?

In [35]:
new_df=data.copy()
new_df
## Dropping the name feature before we scale numeric values as the same will not add any value in model building

new_df.drop(labels='Names',axis=1,inplace=True)
new_df.head()
Out[35]:
Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
0 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60
1 2186 1924 512 16 29 2683 1227 12280 6450 750 1500 29 30 12.2 16 10527 56
2 1428 1097 336 22 50 1036 99 11250 3750 400 1165 53 66 12.9 30 8735 54
3 417 349 137 60 89 510 63 12960 5450 450 875 92 97 7.7 37 19016 59
4 193 146 55 16 44 249 869 7560 4120 800 1500 76 72 11.9 2 10922 15
In [36]:
cat=[]
num=[]
for i in new_df.columns:
    if new_df[i].dtype=="object":
        cat.append(i)
    else:
        num.append(i)
print(cat) 
print(num)
[]
['Apps', 'Accept', 'Enroll', 'Top10perc', 'Top25perc', 'F.Undergrad', 'P.Undergrad', 'Outstate', 'Room.Board', 'Books', 'Personal', 'PhD', 'Terminal', 'S.F.Ratio', 'perc.alumni', 'Expend', 'Grad.Rate']
In [37]:
# Method 1
## Using Zscore for scaling/standardisation
from scipy.stats import zscore
data_scaled=new_df[num].apply(zscore)
data_scaled.head()
Out[37]:
Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
0 -0.346882 -0.321205 -0.063509 -0.258583 -0.191827 -0.168116 -0.209207 -0.746356 -0.964905 -0.602312 1.270045 -0.163028 -0.115729 1.013776 -0.867574 -0.501910 -0.318252
1 -0.210884 -0.038703 -0.288584 -0.655656 -1.353911 -0.209788 0.244307 0.457496 1.909208 1.215880 0.235515 -2.675646 -3.378176 -0.477704 -0.544572 0.166110 -0.551262
2 -0.406866 -0.376318 -0.478121 -0.315307 -0.292878 -0.549565 -0.497090 0.201305 -0.554317 -0.905344 -0.259582 -1.204845 -0.931341 -0.300749 0.585935 -0.177290 -0.667767
3 -0.668261 -0.681682 -0.692427 1.840231 1.677612 -0.658079 -0.520752 0.626633 0.996791 -0.602312 -0.688173 1.185206 1.175657 -1.615274 1.151188 1.792851 -0.376504
4 -0.726176 -0.764555 -0.780735 -0.655656 -0.596031 -0.711924 0.009005 -0.716508 -0.216723 1.518912 0.235515 0.204672 -0.523535 -0.553542 -1.675079 0.241803 -2.939613
In [38]:
new_df.describe()
Out[38]:
Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
count 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.00000
mean 3001.638353 2018.804376 779.972973 27.558559 55.796654 3699.907336 855.298584 10440.669241 4357.526384 549.380952 1340.642214 72.660232 79.702703 14.089704 22.743887 9660.171171 65.46332
std 3870.201484 2451.113971 929.176190 17.640364 19.804778 4850.420531 1522.431887 4023.016484 1096.696416 165.105360 677.071454 16.328155 14.722359 3.958349 12.391801 5221.768440 17.17771
min 81.000000 72.000000 35.000000 1.000000 9.000000 139.000000 1.000000 2340.000000 1780.000000 96.000000 250.000000 8.000000 24.000000 2.500000 0.000000 3186.000000 10.00000
25% 776.000000 604.000000 242.000000 15.000000 41.000000 992.000000 95.000000 7320.000000 3597.000000 470.000000 850.000000 62.000000 71.000000 11.500000 13.000000 6751.000000 53.00000
50% 1558.000000 1110.000000 434.000000 23.000000 54.000000 1707.000000 353.000000 9990.000000 4200.000000 500.000000 1200.000000 75.000000 82.000000 13.600000 21.000000 8377.000000 65.00000
75% 3624.000000 2424.000000 902.000000 35.000000 69.000000 4005.000000 967.000000 12925.000000 5050.000000 600.000000 1700.000000 85.000000 92.000000 16.500000 31.000000 10830.000000 78.00000
max 48094.000000 26330.000000 6392.000000 96.000000 100.000000 31643.000000 21836.000000 21700.000000 8124.000000 2340.000000 6800.000000 103.000000 100.000000 39.800000 64.000000 56233.000000 118.00000
In [39]:
data_scaled.describe()
Out[39]:
Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
count 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02
mean 6.355797e-17 6.774575e-17 -5.249269e-17 -2.753232e-17 -1.546739e-16 -1.661405e-16 -3.029180e-17 6.515595e-17 3.570717e-16 -2.192583e-16 4.765243e-17 5.954768e-17 -4.481615e-16 -2.057556e-17 -6.022638e-17 1.213101e-16 3.886495e-16
std 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00
min -7.551337e-01 -7.947645e-01 -8.022728e-01 -1.506526e+00 -2.364419e+00 -7.346169e-01 -5.615022e-01 -2.014878e+00 -2.351778e+00 -2.747779e+00 -1.611860e+00 -3.962596e+00 -3.785982e+00 -2.929799e+00 -1.836580e+00 -1.240641e+00 -3.230876e+00
25% -5.754408e-01 -5.775805e-01 -5.793514e-01 -7.123803e-01 -7.476067e-01 -5.586426e-01 -4.997191e-01 -7.762035e-01 -6.939170e-01 -4.810994e-01 -7.251203e-01 -6.532948e-01 -5.915023e-01 -6.546598e-01 -7.868237e-01 -5.574826e-01 -7.260193e-01
50% -3.732540e-01 -3.710108e-01 -3.725836e-01 -2.585828e-01 -9.077663e-02 -4.111378e-01 -3.301442e-01 -1.120949e-01 -1.437297e-01 -2.992802e-01 -2.078552e-01 1.433889e-01 1.561419e-01 -1.237939e-01 -1.408197e-01 -2.458933e-01 -2.698956e-02
75% 1.609122e-01 1.654173e-01 1.314128e-01 4.221134e-01 6.671042e-01 6.294077e-02 7.341765e-02 6.179271e-01 6.318245e-01 3.067838e-01 5.310950e-01 7.562224e-01 8.358184e-01 6.093067e-01 6.666852e-01 2.241735e-01 7.302926e-01
max 1.165867e+01 9.924816e+00 6.043678e+00 3.882319e+00 2.233391e+00 5.764674e+00 1.378992e+01 2.800531e+00 3.436593e+00 1.085230e+01 8.068387e+00 1.859323e+00 1.379560e+00 6.499390e+00 3.331452e+00 8.924721e+00 3.060392e+00
In [40]:
# Method II
## Using standardScaler for Standardisation
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(new_df[num])
data_standard=scaler.transform(new_df[num])
data_standard=pd.DataFrame(data_standard, columns=new_df[num].columns)
data_standard.describe()
Out[40]:
Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
count 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02 7.770000e+02
mean 6.355797e-17 6.774575e-17 -5.249269e-17 -2.753232e-17 -1.546739e-16 -1.661405e-16 -3.029180e-17 6.515595e-17 3.570717e-16 -2.192583e-16 4.765243e-17 5.954768e-17 -4.481615e-16 -2.057556e-17 -6.022638e-17 1.213101e-16 3.886495e-16
std 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00 1.000644e+00
min -7.551337e-01 -7.947645e-01 -8.022728e-01 -1.506526e+00 -2.364419e+00 -7.346169e-01 -5.615022e-01 -2.014878e+00 -2.351778e+00 -2.747779e+00 -1.611860e+00 -3.962596e+00 -3.785982e+00 -2.929799e+00 -1.836580e+00 -1.240641e+00 -3.230876e+00
25% -5.754408e-01 -5.775805e-01 -5.793514e-01 -7.123803e-01 -7.476067e-01 -5.586426e-01 -4.997191e-01 -7.762035e-01 -6.939170e-01 -4.810994e-01 -7.251203e-01 -6.532948e-01 -5.915023e-01 -6.546598e-01 -7.868237e-01 -5.574826e-01 -7.260193e-01
50% -3.732540e-01 -3.710108e-01 -3.725836e-01 -2.585828e-01 -9.077663e-02 -4.111378e-01 -3.301442e-01 -1.120949e-01 -1.437297e-01 -2.992802e-01 -2.078552e-01 1.433889e-01 1.561419e-01 -1.237939e-01 -1.408197e-01 -2.458933e-01 -2.698956e-02
75% 1.609122e-01 1.654173e-01 1.314128e-01 4.221134e-01 6.671042e-01 6.294077e-02 7.341765e-02 6.179271e-01 6.318245e-01 3.067838e-01 5.310950e-01 7.562224e-01 8.358184e-01 6.093067e-01 6.666852e-01 2.241735e-01 7.302926e-01
max 1.165867e+01 9.924816e+00 6.043678e+00 3.882319e+00 2.233391e+00 5.764674e+00 1.378992e+01 2.800531e+00 3.436593e+00 1.085230e+01 8.068387e+00 1.859323e+00 1.379560e+00 6.499390e+00 3.331452e+00 8.924721e+00 3.060392e+00
In [41]:
# Method III Min-Max method
from sklearn.preprocessing import MinMaxScaler
# build the scaler model
scaler = MinMaxScaler().fit(new_df[num])
# transform the test test
data_minmax = scaler.transform(new_df[num])
data_minmax=pd.DataFrame(data_minmax, columns=new_df[num].columns)
data_minmax.describe()
Out[41]:
Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
count 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000
mean 0.060830 0.074141 0.117189 0.279564 0.514249 0.113030 0.039125 0.418423 0.406294 0.202041 0.166510 0.680634 0.732930 0.310716 0.355373 0.122046 0.513549
std 0.080607 0.093347 0.146166 0.185688 0.217635 0.153962 0.069724 0.207800 0.172871 0.073576 0.103370 0.171875 0.193715 0.106122 0.193622 0.098437 0.159053
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.014475 0.020260 0.032563 0.147368 0.351648 0.027076 0.004305 0.257231 0.286412 0.166667 0.091603 0.568421 0.618421 0.241287 0.203125 0.067205 0.398148
50% 0.030763 0.039531 0.062765 0.231579 0.494505 0.049771 0.016121 0.395145 0.381463 0.180036 0.145038 0.705263 0.763158 0.297587 0.328125 0.097857 0.509259
75% 0.073793 0.089573 0.136385 0.357895 0.659341 0.122715 0.044241 0.546746 0.515448 0.224599 0.221374 0.810526 0.894737 0.375335 0.484375 0.144099 0.629630
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

Applying zscore or using StandardScalar give us the same results

It scales the data in such a way that the mean value of the features tends to 0 and the standard deviation tends to 1

Min-Max method ensure that the data scaled to have values in the range 0 to 1

2.5 Perform PCA and export the data of the Principal Component scores into a data frame.

In [42]:
from sklearn.decomposition import PCA
In [43]:
pca = PCA(n_components = 1)
In [44]:
data_reduced_PC1 = pca.fit_transform(new_df)
data_reduced_PC1
Out[44]:
array([[-2.55183786e+03],
       [-7.43729533e+02],
       [-3.37355621e+03],
       [-1.43682677e+03],
       [-4.41383950e+03],
       [-3.93597892e+03],
       [-4.45329067e+03],
       [-1.53563613e+03],
       [-2.62197055e+03],
       [-4.19874638e+03],
       [-1.57067598e+03],
       [-7.52616774e+02],
       [-3.79665811e+03],
       [-3.04488699e+03],
       [-4.18426676e+03],
       [-3.89363354e+03],
       [ 2.79734289e+03],
       [-3.28808694e+03],
       [-2.73099569e+03],
       [-1.22440720e+03],
       [ 6.05539116e+03],
       [ 6.32161375e+03],
       [-4.43321081e+03],
       [ 2.05385441e+04],
       [-2.82929172e+03],
       [-2.72036280e+03],
       [-2.62293514e+03],
       [ 1.19006303e+04],
       [-3.46141380e+03],
       [-2.03819740e+03],
       [-4.23058154e+03],
       [-2.91690236e+03],
       [-4.85679612e+03],
       [-4.94451353e+03],
       [-1.99876498e+03],
       [-4.64858580e+03],
       [-1.22183581e+03],
       [-3.12754801e+02],
       [-2.70952205e+03],
       [ 6.34343367e+03],
       [-3.47375362e+03],
       [-4.44920249e+03],
       [-4.67728662e+03],
       [-3.35742787e+03],
       [-1.97283008e+03],
       [-3.45983124e+03],
       [-5.11112625e+03],
       [-1.77633201e+03],
       [ 5.99312757e+02],
       [-2.30897231e+03],
       [-4.45395740e+03],
       [-5.00401534e+03],
       [-5.26659319e+03],
       [-3.34823783e+03],
       [-3.65598259e+03],
       [-4.99777239e+03],
       [ 2.99214594e+03],
       [-5.41316217e+03],
       [-4.16420398e+03],
       [ 2.41981001e+04],
       [ 1.88861973e+03],
       [ 1.13860749e+04],
       [-2.44309579e+03],
       [ 9.35637691e+02],
       [ 2.96839924e+03],
       [-5.16995773e+03],
       [-4.45907125e+03],
       [-4.55794639e+03],
       [-4.01951964e+03],
       [ 1.90496591e+04],
       [ 1.07891864e+04],
       [-4.14975828e+02],
       [ 3.88863418e+03],
       [-3.79542442e+03],
       [-1.05369457e+03],
       [-4.48312899e+03],
       [-4.28353008e+03],
       [-3.66501201e+03],
       [ 9.16703476e+03],
       [ 7.16857789e+03],
       [-1.69938456e+03],
       [-3.01869456e+03],
       [-5.21530875e+03],
       [-1.01037886e+03],
       [-2.22835784e+03],
       [-5.36571667e+03],
       [ 1.19955026e+03],
       [ 9.41672496e+03],
       [-3.53326206e+03],
       [-3.87384168e+03],
       [-3.13424359e+03],
       [ 3.64829923e+03],
       [-4.04294411e+03],
       [-3.90257054e+03],
       [-8.91485460e+02],
       [-1.60522755e+03],
       [-3.15421840e+03],
       [-3.36677023e+03],
       [-4.35530499e+03],
       [-4.15933257e+03],
       [-3.08174058e+03],
       [-3.34980009e+03],
       [ 1.33786829e+03],
       [ 3.22754642e+03],
       [ 6.48796579e+02],
       [-6.19851461e+03],
       [-2.43198226e+03],
       [-1.80127744e+03],
       [-2.45819454e+03],
       [-4.94540286e+03],
       [-4.64089964e+03],
       [-4.14669854e+03],
       [-3.67766844e+03],
       [-4.58060977e+03],
       [-1.60246741e+02],
       [-2.98983769e+02],
       [-4.72187096e+03],
       [-5.97230575e+02],
       [ 8.79135241e+03],
       [-5.35759773e+03],
       [-3.06651862e+03],
       [-4.33554157e+03],
       [ 5.43555126e+02],
       [ 2.61780874e+03],
       [-3.35289796e+03],
       [ 1.92248106e+03],
       [-4.29069144e+03],
       [-3.54366357e+03],
       [-4.17711631e+03],
       [-4.55850413e+03],
       [-3.16060698e+03],
       [-3.99643952e+03],
       [-4.24521738e+03],
       [-4.03065431e+03],
       [-4.12291937e+03],
       [-5.67186985e+03],
       [-3.63916081e+03],
       [ 2.70662594e+02],
       [ 3.66294659e+03],
       [-3.94425034e+02],
       [ 4.38677197e+02],
       [ 1.27680337e+04],
       [-5.25345855e+03],
       [-4.01174261e+03],
       [ 8.39154014e+03],
       [-4.87554743e+03],
       [-5.47000930e+03],
       [-4.37354444e+03],
       [-4.56160505e+03],
       [ 3.39413786e+02],
       [-3.76499569e+03],
       [-2.46838881e+03],
       [ 3.98683570e+03],
       [-4.37389997e+03],
       [-3.98468679e+03],
       [-4.46833758e+03],
       [-4.58550256e+03],
       [-4.36002117e+03],
       [ 9.67332506e+03],
       [ 4.40924534e+02],
       [-4.71893816e+03],
       [-3.91195523e+03],
       [-2.63934913e+02],
       [-1.19977820e+03],
       [ 5.45066827e+02],
       [-5.55248409e+03],
       [-3.15419464e+03],
       [-4.51700264e+03],
       [-5.41509346e+03],
       [-4.50185827e+03],
       [-2.83379402e+03],
       [-5.19089774e+01],
       [-4.59134543e+02],
       [-3.91138469e+03],
       [ 1.39636414e+04],
       [-1.53810468e+03],
       [ 1.12711288e+04],
       [ 2.48687212e+03],
       [-5.49118264e+03],
       [-4.29342157e+03],
       [-2.64249458e+03],
       [ 4.53307351e+03],
       [-3.89210826e+03],
       [-4.83841930e+03],
       [-1.09333837e+03],
       [-1.74766875e+03],
       [-3.41445647e+03],
       [-4.52602486e+03],
       [-1.00005456e+03],
       [ 1.83919572e+03],
       [-4.38699502e+03],
       [ 1.09610746e+04],
       [-2.59299837e+03],
       [-4.74703360e+03],
       [-4.28660753e+03],
       [-2.11490778e+03],
       [ 1.74887882e+03],
       [-3.03036342e+03],
       [-3.71411473e+03],
       [-5.04667009e+03],
       [-2.10201577e+03],
       [ 4.44847149e+03],
       [-3.85979515e+03],
       [ 1.69623300e+04],
       [-5.68704771e+03],
       [ 2.41597161e+03],
       [-1.19538362e+03],
       [-2.77005553e+03],
       [-4.67913170e+03],
       [-3.17711937e+03],
       [ 7.50323652e+00],
       [-4.87031537e+03],
       [-4.76117836e+03],
       [-3.58201382e+02],
       [-1.58721080e+03],
       [-4.26479955e+03],
       [-4.44417539e+03],
       [-4.09211472e+03],
       [ 5.81653508e+03],
       [ 7.04085786e+03],
       [-4.43792098e+03],
       [ 9.74763023e+03],
       [ 7.22881419e+03],
       [ 3.46978295e+03],
       [-4.69324179e+03],
       [ 1.17381979e+03],
       [-5.15236229e+03],
       [-1.54144485e+03],
       [-3.70933489e+03],
       [-3.90950164e+03],
       [-1.14934856e+03],
       [-4.21846331e+03],
       [-4.95065522e+03],
       [ 3.58669320e+03],
       [-4.92208644e+03],
       [-4.68211331e+03],
       [-4.52052449e+03],
       [ 5.18440205e+02],
       [-3.43760854e+03],
       [-2.97544904e+03],
       [-1.76336069e+03],
       [-4.73181136e+03],
       [ 1.36604297e+03],
       [-2.92018236e+03],
       [-2.79011370e+03],
       [ 2.59347054e+03],
       [-3.74252680e+03],
       [-4.65581816e+03],
       [-2.73347090e+03],
       [-1.28859504e+03],
       [ 1.68420745e+04],
       [ 2.88931988e+02],
       [-4.44188086e+03],
       [-4.10670970e+03],
       [-2.56815132e+03],
       [-2.62285390e+03],
       [-7.82290766e+01],
       [ 6.05973436e+03],
       [-2.52933100e+03],
       [-2.88510853e+03],
       [-1.81388984e+03],
       [-3.87123397e+03],
       [-4.86085468e+03],
       [-4.40124277e+03],
       [-4.68180465e+03],
       [-5.19638407e+03],
       [-3.99088978e+03],
       [-4.46263334e+03],
       [-1.03632742e+03],
       [ 1.18170725e+04],
       [-1.42838691e+03],
       [-5.01643488e+03],
       [-3.92617893e+03],
       [ 5.18054390e+03],
       [ 2.61466291e+04],
       [-3.72461662e+03],
       [ 1.42493728e+03],
       [ 1.48487859e+04],
       [ 5.18121243e+03],
       [ 8.41962725e+03],
       [-3.93036844e+01],
       [-8.23679499e+02],
       [-4.57409138e+03],
       [-1.17396963e+03],
       [ 1.76886819e+04],
       [-4.52470065e+03],
       [-4.97726589e+03],
       [-2.50170458e+03],
       [ 9.01596131e+03],
       [-5.21423042e+03],
       [-1.10041689e+03],
       [-4.97511506e+03],
       [-5.29685432e+02],
       [-4.71813781e+03],
       [-3.95863850e+03],
       [-4.67917820e+03],
       [-2.17854231e+03],
       [-5.04009344e+03],
       [-6.52814515e+02],
       [ 1.62499925e+03],
       [-4.52476725e+03],
       [-1.49263304e+03],
       [-5.48975561e+03],
       [-3.70332023e+02],
       [-5.29027755e+03],
       [-3.92469974e+03],
       [-1.61699136e+03],
       [-2.89285428e+03],
       [-3.40157806e+03],
       [ 4.99471801e+03],
       [-3.76428023e+03],
       [-4.23691743e+03],
       [-4.48586664e+03],
       [-2.54140733e+02],
       [-2.92934785e+03],
       [-5.28326095e+03],
       [-2.86880103e+03],
       [-4.71387773e+03],
       [-2.69832865e+03],
       [-4.42269938e+03],
       [-1.07094826e+03],
       [-2.16733502e+03],
       [-3.06971144e+03],
       [-3.84644663e+03],
       [ 1.06321700e+04],
       [ 3.41424853e+01],
       [ 5.68418771e+02],
       [ 1.31253007e+03],
       [-1.38564424e+03],
       [ 3.06126453e+03],
       [-2.19631248e+03],
       [-3.31703940e+03],
       [-2.56433161e+03],
       [-4.81507064e+03],
       [-3.93364225e+01],
       [-3.77873082e+03],
       [-4.61871325e+03],
       [-4.08229599e+03],
       [-9.60794830e+02],
       [-2.90240944e+03],
       [ 2.95239738e+03],
       [-4.81554020e+03],
       [-2.69588747e+03],
       [ 9.42722197e+02],
       [ 4.61515002e+03],
       [ 2.95501241e+03],
       [-3.51757842e+03],
       [-1.12099196e+03],
       [-3.76588962e+03],
       [-3.35955110e+03],
       [-3.56111152e+03],
       [-3.82029728e+03],
       [-4.64266485e+03],
       [-3.23659587e+03],
       [ 9.97761664e+03],
       [-5.52607718e+03],
       [-4.70978947e+03],
       [-5.09241541e+03],
       [-4.56359399e+03],
       [-1.12585396e+03],
       [-3.24039052e+03],
       [-4.09533253e+03],
       [-2.05386318e+03],
       [-3.10359326e+03],
       [-2.54703538e+03],
       [ 1.18160860e+04],
       [ 2.96410549e+04],
       [ 3.07620031e+02],
       [-5.30736179e+03],
       [ 2.41742157e+03],
       [-4.67134772e+03],
       [-2.90958912e+03],
       [-2.97125145e+03],
       [-3.33534957e+03],
       [-4.67584061e+03],
       [ 4.54600383e+03],
       [-4.90611478e+03],
       [-2.90890294e+03],
       [-4.86283619e+03],
       [-3.53489875e+03],
       [-1.36519837e+03],
       [-4.77334592e+03],
       [ 2.83993213e+03],
       [ 2.13396948e+03],
       [-4.02018800e+03],
       [-4.61642475e+02],
       [-2.87704799e+03],
       [-1.00045378e+03],
       [-4.32265083e+03],
       [-4.78049594e+03],
       [ 8.04161390e+02],
       [-5.93102171e+03],
       [-5.04667175e+03],
       [-4.98020309e+03],
       [-5.47582929e+03],
       [-3.18130465e+03],
       [-4.25822344e+03],
       [-3.30847386e+03],
       [-3.59784220e+03],
       [-4.82696079e+03],
       [-1.08807999e+03],
       [-3.20959489e+02],
       [-3.27620886e+03],
       [-3.47309153e+03],
       [-3.52478889e+03],
       [-3.23608087e+02],
       [-4.08481807e+03],
       [ 1.78569757e+04],
       [-5.08067966e+03],
       [-2.36432018e+03],
       [-4.04315121e+03],
       [ 2.09808957e+03],
       [ 1.53713674e+04],
       [-3.89123891e+03],
       [-3.17537599e+03],
       [ 1.41259914e+03],
       [-3.57017520e+03],
       [ 2.69259025e+03],
       [ 1.37079952e+04],
       [ 6.82080107e+03],
       [ 1.26844136e+04],
       [-9.01073654e+02],
       [-4.80885127e+03],
       [-4.56251636e+03],
       [ 1.41073477e+04],
       [-2.05891523e+03],
       [-5.74860328e+03],
       [ 1.47139712e+03],
       [ 2.97887924e+03],
       [ 1.49227115e+02],
       [-3.92352583e+03],
       [-4.51801120e+02],
       [ 1.41109476e+04],
       [-9.30516163e+02],
       [-4.63316424e+03],
       [-4.47566807e+03],
       [ 6.68826072e+03],
       [-2.66083149e+03],
       [-4.53745469e+03],
       [-3.21649280e+03],
       [ 5.59675257e+03],
       [-1.74686463e+03],
       [-3.41492230e+03],
       [-2.84638144e+03],
       [-3.88966332e+03],
       [ 2.92150056e+04],
       [ 1.99955902e+03],
       [-5.62559342e+03],
       [-4.57862775e+03],
       [-2.84531473e+03],
       [-4.24975680e+03],
       [-5.01891278e+03],
       [-5.62195982e+03],
       [-1.72667260e+03],
       [-3.56361312e+03],
       [-3.95411942e+03],
       [-1.43604660e+03],
       [-1.12303058e+03],
       [-3.68001698e+03],
       [ 1.22574355e+04],
       [ 1.56285774e+03],
       [ 3.20303485e+04],
       [-4.16257132e+03],
       [-4.31483306e+03],
       [-3.72893222e+02],
       [ 3.79478057e+03],
       [-2.43958389e+03],
       [-2.27046818e+03],
       [-1.79134472e+03],
       [-2.49335362e+02],
       [-3.44861603e+03],
       [ 4.34834989e+03],
       [-8.06613040e+02],
       [ 5.77118008e+02],
       [-2.96786361e+03],
       [-4.97571665e+03],
       [-2.03877788e+03],
       [-4.03011302e+03],
       [-5.14238029e+03],
       [-9.16327772e+02],
       [-1.47960057e+03],
       [-4.32138206e+03],
       [ 6.22715631e+02],
       [ 4.63436778e+04],
       [-6.59521879e+02],
       [ 2.57598624e+03],
       [-2.37128835e+03],
       [-3.97480322e+03],
       [-2.70902430e+03],
       [ 4.66361880e+03],
       [-5.26271434e+03],
       [-3.91135434e+03],
       [-2.92381101e+03],
       [-4.34366151e+03],
       [-5.17281372e+03],
       [-1.26947606e+03],
       [-4.54106133e+03],
       [ 3.84796103e+03],
       [-2.71242318e+03],
       [-4.73061927e+03],
       [-4.59828832e+03],
       [-1.95925352e+03],
       [-1.62872703e+02],
       [-2.80637290e+03],
       [-4.03369910e+03],
       [-3.33885071e+03],
       [-4.38602730e+03],
       [-4.31174764e+03],
       [-4.21145391e+02],
       [-1.68169103e+03],
       [ 1.39531891e+04],
       [ 1.46336341e+03],
       [-1.26383834e+03],
       [-3.75287477e+03],
       [-4.56503206e+03],
       [-9.60855657e+02],
       [-2.27646887e+03],
       [-9.42696369e+02],
       [ 2.23902025e+03],
       [-3.52922962e+03],
       [ 2.02211196e+03],
       [-4.67216924e+03],
       [-1.18518406e+03],
       [-4.28448237e+03],
       [-1.71848308e+03],
       [-3.92200656e+03],
       [-4.82873726e+03],
       [ 1.95026286e+03],
       [ 2.77378094e+03],
       [ 7.00186479e+02],
       [ 1.68234784e+02],
       [-4.18642562e+03],
       [-4.62120446e+03],
       [ 6.97314010e+02],
       [ 3.07657203e+03],
       [-4.37684620e+03],
       [ 7.94643790e+03],
       [-3.88348418e+03],
       [-5.73571573e+03],
       [-5.25140822e+03],
       [-2.14657360e+03],
       [-5.37283666e+03],
       [-1.19494119e+03],
       [-3.41605496e+03],
       [-2.51082495e+03],
       [-2.42078509e+03],
       [-3.14940891e+03],
       [ 4.30828940e+02],
       [-4.57855973e+03],
       [-9.67346263e+02],
       [-3.10805240e+03],
       [-3.15189003e+03],
       [-2.70847863e+03],
       [-4.73447471e+03],
       [-4.45686428e+03],
       [-3.84973339e+03],
       [-2.06925213e+03],
       [-1.42745850e+03],
       [-7.57188929e+02],
       [-1.18458517e+03],
       [ 1.24982600e+04],
       [ 1.04982498e+04],
       [ 1.70251971e+04],
       [ 1.22261935e+04],
       [ 3.27019385e+03],
       [ 4.45411916e+03],
       [ 3.75897923e+03],
       [ 2.69605598e+03],
       [ 3.92766871e+02],
       [ 3.49432372e+03],
       [ 3.14951592e+03],
       [ 1.56775305e+03],
       [-1.11720348e+03],
       [-2.19394466e+03],
       [-1.74186233e+03],
       [-1.30751892e+03],
       [ 1.20991368e+04],
       [-5.43679453e+03],
       [-2.78358700e+03],
       [-2.82585391e+03],
       [-5.70094186e+03],
       [ 2.83822760e+04],
       [-5.04070261e+03],
       [ 1.66580751e+03],
       [-4.89037883e+03],
       [ 9.92228495e+02],
       [-4.40045860e+03],
       [-3.12902076e+03],
       [-3.57960955e+03],
       [-5.31920367e+03],
       [-3.57317052e+03],
       [ 1.65702565e+03],
       [-3.91925485e+03],
       [ 1.46163767e+03],
       [-3.82681355e+03],
       [-4.45239040e+03],
       [ 1.13373043e+01],
       [ 6.91798816e+03],
       [-5.46400525e+03],
       [-1.03816035e+03],
       [-5.21191434e+03],
       [ 1.03896776e+03],
       [ 2.85868170e+03],
       [ 3.37066419e+03],
       [ 3.82682852e+03],
       [ 2.40126960e+04],
       [ 1.83190664e+04],
       [ 8.12132633e+03],
       [-4.54174942e+03],
       [ 1.03643377e+04],
       [ 9.80883192e+03],
       [ 1.19520410e+04],
       [-3.92161777e+03],
       [ 4.36419125e+03],
       [ 1.73250113e+04],
       [ 1.10173710e+02],
       [-2.13304728e+03],
       [-4.34022905e+03],
       [-1.95642920e+03],
       [ 2.34989088e+04],
       [ 1.69915020e+04],
       [ 2.56149976e+03],
       [ 6.28326630e+03],
       [ 2.47529076e+04],
       [ 1.24749178e+04],
       [-3.42066638e+03],
       [ 1.41288741e+04],
       [-2.31451190e+03],
       [ 6.14874827e+03],
       [-4.08411226e+03],
       [-5.53651699e+03],
       [-5.09805110e+03],
       [ 2.32313625e+03],
       [ 2.00985254e+04],
       [ 1.90025229e+04],
       [ 3.21325850e+02],
       [ 9.00553662e+03],
       [ 2.75052406e+04],
       [ 1.63367822e+03],
       [-3.52769574e+03],
       [ 1.91073316e+04],
       [ 2.49462336e+03],
       [ 1.07492013e+04],
       [-6.52569496e+02],
       [-1.03342043e+03],
       [-5.16626488e+03],
       [-3.76844505e+03],
       [ 1.12602653e+04],
       [-3.39505174e+03],
       [ 9.98383349e+03],
       [-3.25824762e+03],
       [ 1.72145328e+04],
       [ 5.88216995e+03],
       [ 3.94326742e+03],
       [ 3.19304018e+03],
       [ 2.85623426e+03],
       [-1.76146814e+03],
       [ 7.08526413e+03],
       [ 4.37107962e+03],
       [ 4.40737898e+03],
       [ 7.52276025e+03],
       [ 8.45924648e+03],
       [ 9.94496491e+03],
       [ 1.53819644e+04],
       [ 1.27840072e+04],
       [-2.21395285e+03],
       [ 9.12390639e+02],
       [ 9.38738602e+03],
       [ 2.12046384e+03],
       [ 1.06016274e+04],
       [ 1.27742408e+03],
       [ 2.12601949e+02],
       [-6.07601310e+03],
       [ 1.04161605e+03],
       [-4.57529056e+03],
       [ 9.87991573e+03],
       [ 1.23665051e+04],
       [ 1.66895266e+04],
       [-2.58902330e+03],
       [-1.57830172e+03],
       [ 2.26055419e+03],
       [-5.85170345e+02],
       [-4.83031883e+03],
       [ 1.17430249e+04],
       [ 4.35832955e+03],
       [ 2.70799508e+04],
       [ 3.60719150e+03],
       [-2.78762122e+03],
       [ 4.37780373e+02],
       [-1.92181078e+02],
       [-7.48105441e+02],
       [ 9.62927233e+03],
       [ 7.80794608e+03],
       [ 1.47964914e+04],
       [ 2.09877005e+04],
       [-1.47987657e+03],
       [ 8.07007028e+01],
       [-4.32085843e+03],
       [ 2.84125786e+03],
       [-1.54831297e+03],
       [ 2.42761758e+04],
       [ 7.44114301e+03],
       [ 1.42870659e+03],
       [-5.24440080e+03],
       [-2.15188910e+03],
       [-5.01499576e+03],
       [-5.84831498e+03],
       [-1.49206882e+03],
       [ 9.42225207e+03],
       [ 1.79407488e+03],
       [ 6.38681853e+03],
       [ 7.05778724e+03],
       [-1.80716745e+03],
       [ 2.05787860e+04],
       [-3.55799698e+03],
       [-3.59290891e+03],
       [-4.30288415e+03],
       [-4.88348319e+03],
       [-2.16993613e+03],
       [-2.92820888e+03],
       [ 1.09887546e+04],
       [-4.69527680e+03],
       [-4.50572387e+03],
       [-3.60021281e+03],
       [-2.71021341e+03],
       [ 2.54863416e+02],
       [-2.85563311e+03],
       [ 1.10466986e+04],
       [ 1.56348037e+04],
       [-3.77614201e+03],
       [-4.15950107e+03],
       [-6.30734355e+03],
       [-4.03519809e+03],
       [ 2.46106377e+03],
       [-2.98319906e+03],
       [ 7.22609628e+02],
       [-4.56526256e+03],
       [ 2.65556128e+03],
       [ 4.23033173e+03],
       [-4.46772173e+03],
       [-2.87744358e+03],
       [ 6.39787852e+00],
       [-2.91319912e+03],
       [ 1.26743961e+04],
       [-2.63034487e+03],
       [-3.06402832e+03],
       [ 4.68253312e+03],
       [-2.07301700e+03],
       [-4.51008294e+03],
       [-3.45149732e+03],
       [-4.33343827e+03],
       [-3.37129540e+03],
       [-1.66777634e+03],
       [ 1.63626517e+03],
       [-4.21493038e+03],
       [-1.41794973e+03],
       [-1.87565090e+03],
       [-3.42541154e+03],
       [-1.45155861e+03],
       [-2.58025229e+03],
       [-1.96735409e+03],
       [-4.13602721e+03],
       [-4.61533689e+03],
       [ 3.27194815e+03],
       [-4.45957770e+03],
       [-3.99826323e+03],
       [ 3.83622148e+00],
       [-1.76820164e+03],
       [-5.01714150e+03],
       [-1.41519197e+03],
       [-3.54341723e+03],
       [-8.85287828e+01],
       [-2.66784428e+03],
       [-1.27192045e+03],
       [-1.83820685e+03],
       [ 1.50231866e+04],
       [-2.28658223e+03]])
In [45]:
#The amount of variance that each PC explains
var= pca.explained_variance_ratio_
var
Out[45]:
array([0.46359217])
In [46]:
# normalize data
#from sklearn import preprocessing
#data_scaled = pd.DataFrame(preprocessing.scale(bat_PCA),columns = bat_PCA.columns) 
data_scaled = new_df 

# PCA
pca = PCA(n_components=1)
pca.fit_transform(data_scaled)

# Dump components relations with features:
df_PC =pd.DataFrame(pca.components_,columns=data_scaled.columns,index = ['PC-1'])
In [47]:
pca.components_
Out[47]:
array([[ 5.57026265e-01,  3.47711968e-01,  1.29854039e-01,
         1.02538882e-03,  1.17742114e-03,  6.70614019e-01,
         1.11112714e-01,  5.48419377e-02,  2.88655215e-02,
         3.73421983e-03,  2.31322104e-02,  1.13882279e-03,
         9.89571300e-04,  2.85990622e-05, -1.09332895e-04,
         2.92390370e-01,  3.20352937e-04]])

2.6 Extract the eigenvalues, and eigenvectors.

In [61]:
from factor_analyzer.factor_analyzer import calculate_kmo
kmo_all,kmo_model=calculate_kmo(new_df)
kmo_model
Out[61]:
0.8131251200373506
In [52]:
cov_matrix = np.cov(new_df.T)
print('Covariance Matrix \n%s', cov_matrix)
Covariance Matrix 
%s [[ 1.49784595e+07  8.94985981e+06  3.04525599e+06  2.31327731e+04
   2.69526635e+04  1.52897025e+07  2.34662015e+06  7.80970356e+05
   7.00072872e+05  8.47037526e+04  4.68346833e+05  2.46894337e+04
   2.10530676e+04  1.46506058e+03 -4.32712238e+03  5.24617110e+06
   9.75642164e+03]
 [ 8.94985981e+06  6.00795970e+06  2.07626776e+06  8.32112487e+03
   1.20134048e+04  1.03935824e+07  1.64666972e+06 -2.53962285e+05
   2.44347147e+05  4.59428079e+04  3.33556631e+05  1.42382015e+04
   1.21820938e+04  1.70983819e+03 -4.85948702e+03  1.59627169e+06
   2.83416292e+03]
 [ 3.04525599e+06  2.07626776e+06  8.63368392e+05  2.97158341e+03
   4.17259244e+03  4.34752988e+06  7.25790674e+05 -5.81188483e+05
  -4.09970592e+04  1.72911997e+04  1.76737970e+05  5.02896117e+03
   4.21708603e+03  8.72684773e+02 -2.08169379e+03  3.11345431e+05
  -3.56587977e+02]
 [ 2.31327731e+04  8.32112487e+03  2.97158341e+03  3.11182456e+02
   3.11630480e+02  1.20891137e+04 -2.82947498e+03  3.99071798e+04
   7.18670561e+03  3.46177405e+02 -1.11455119e+03  1.53184870e+02
   1.27551581e+02 -2.68745252e+01  9.95672077e+01  6.08793102e+04
   1.49992164e+02]
 [ 2.69526635e+04  1.20134048e+04  4.17259244e+03  3.11630480e+02
   3.92229216e+02  1.91589528e+04 -1.61541214e+03  3.89924275e+04
   7.19990357e+03  3.77759266e+02 -1.08360506e+03  1.76518449e+02
   1.53002612e+02 -2.30971994e+01  1.02550946e+02  5.45464833e+04
   1.62371398e+02]
 [ 1.52897025e+07  1.03935824e+07  4.34752988e+06  1.20891137e+04
   1.91589528e+04  2.35265793e+07  4.21291009e+06 -4.20984304e+06
  -3.66458224e+05  9.25357647e+04  1.04170909e+06  2.52117842e+04
   2.14242417e+04  5.37020858e+03 -1.37919297e+04  4.72403958e+05
  -6.56330753e+03]
 [ 2.34662015e+06  1.64666972e+06  7.25790674e+05 -2.82947498e+03
  -1.61541214e+03  4.21291009e+06  2.31779885e+06 -1.55270428e+06
  -1.02391862e+05  2.04104467e+04  3.29732427e+05  3.70675622e+03
   3.18059661e+03  1.40130256e+03 -5.29733709e+03 -6.64351154e+05
  -6.72106249e+03]
 [ 7.80970356e+05 -2.53962285e+05 -5.81188483e+05  3.99071798e+04
   3.89924275e+04 -4.20984304e+06 -1.55270428e+06  1.61846616e+07
   2.88659739e+06  2.58082421e+04 -8.14673718e+05  2.51575151e+04
   2.41641477e+04 -8.83525354e+03  2.82295531e+04  1.41332357e+07
   3.94796818e+04]
 [ 7.00072872e+05  2.44347147e+05 -4.09970592e+04  7.18670561e+03
   7.19990357e+03 -3.66458224e+05 -1.02391862e+05  2.88659739e+06
   1.20274303e+06  2.31703134e+04 -1.48083768e+05  5.89503475e+03
   6.04729974e+03 -1.57420591e+03  3.70143138e+03  2.87330848e+06
   8.00536018e+03]
 [ 8.47037526e+04  4.59428079e+04  1.72911997e+04  3.46177405e+02
   3.77759266e+02  9.25357647e+04  2.04104467e+04  2.58082421e+04
   2.31703134e+04  2.72597799e+04  2.00430257e+04  7.25342415e+01
   2.42963918e+02 -2.08672067e+01 -8.22631321e+01  9.69125803e+04
   3.00883652e+00]
 [ 4.68346833e+05  3.33556631e+05  1.76737970e+05 -1.11455119e+03
  -1.08360506e+03  1.04170909e+06  3.29732427e+05 -8.14673718e+05
  -1.48083768e+05  2.00430257e+04  4.58425753e+05 -1.20898783e+02
  -3.05154186e+02  3.65415770e+02 -2.39931082e+03 -3.46097802e+05
  -3.13261494e+03]
 [ 2.46894337e+04  1.42382015e+04  5.02896117e+03  1.53184870e+02
   1.76518449e+02  2.52117842e+04  3.70675622e+03  2.51575151e+04
   5.89503475e+03  7.25342415e+01 -1.20898783e+02  2.66608636e+02
   2.04231332e+02 -8.43649246e+00  5.03832295e+01  3.68980582e+04
   8.55571090e+01]
 [ 2.10530676e+04  1.21820938e+04  4.21708603e+03  1.27551581e+02
   1.53002612e+02  2.14242417e+04  3.18059661e+03  2.41641477e+04
   6.04729974e+03  2.42963918e+02 -3.05154186e+02  2.04231332e+02
   2.16747841e+02 -9.33025564e+00  4.87343271e+01  3.37334569e+04
   7.32203957e+01]
 [ 1.46506058e+03  1.70983819e+03  8.72684773e+02 -2.68745252e+01
  -2.30971994e+01  5.37020858e+03  1.40130256e+03 -8.83525354e+03
  -1.57420591e+03 -2.08672067e+01  3.65415770e+02 -8.43649246e+00
  -9.33025564e+00  1.56685279e+01 -1.97641094e+01 -1.20675646e+04
  -2.08548884e+01]
 [-4.32712238e+03 -4.85948702e+03 -2.08169379e+03  9.95672077e+01
   1.02550946e+02 -1.37919297e+04 -5.29733709e+03  2.82295531e+04
   3.70143138e+03 -8.22631321e+01 -2.39931082e+03  5.03832295e+01
   4.87343271e+01 -1.97641094e+01  1.53556744e+02  2.70289215e+04
   1.04493815e+02]
 [ 5.24617110e+06  1.59627169e+06  3.11345431e+05  6.08793102e+04
   5.45464833e+04  4.72403958e+05 -6.64351154e+05  1.41332357e+07
   2.87330848e+06  9.69125803e+04 -3.46097802e+05  3.68980582e+04
   3.37334569e+04 -1.20675646e+04  2.70289215e+04  2.72668656e+07
   3.50129683e+04]
 [ 9.75642164e+03  2.83416292e+03 -3.56587977e+02  1.49992164e+02
   1.62371398e+02 -6.56330753e+03 -6.72106249e+03  3.94796818e+04
   8.00536018e+03  3.00883652e+00 -3.13261494e+03  8.55571090e+01
   7.32203957e+01 -2.08548884e+01  1.04493815e+02  3.50129683e+04
   2.95073717e+02]]
In [53]:
# Step 2- Get eigen values and eigen vector
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)
print('Eigen Vectors \n%s', eig_vecs)
print('\n Eigen Values \n%s', eig_vals)
Eigen Vectors 
%s [[ 5.57026265e-01  3.93606986e-02 -1.67353250e-01 -6.64271177e-01
  -1.64686863e-01 -5.80500092e-02 -1.34342602e-01  4.11793520e-01
  -2.74068846e-02 -3.76034135e-03 -4.38857450e-03  3.02180551e-03
   5.30739374e-04  4.00796357e-04 -1.86846892e-04  5.88553549e-04
  -1.12791915e-03]
 [ 3.47711968e-01  7.71620231e-02 -1.62363494e-01 -2.32927268e-01
  -5.80585150e-03 -6.02807211e-02  2.39957336e-01 -8.41806975e-01
   1.49989709e-01  5.82362196e-03  8.23337143e-03 -5.36725416e-03
  -1.75963534e-03  1.07556459e-03  2.32571906e-04 -1.25067011e-03
   2.47359512e-03]
 [ 1.29854039e-01  4.54128642e-02 -9.66343352e-03  5.88323506e-02
   6.40769269e-02 -2.13069540e-02  4.08911751e-02 -1.18368499e-01
  -9.78309623e-01 -7.72030365e-03 -7.08566943e-03  6.92580554e-03
  -1.31713816e-03 -3.49723514e-03  6.24356491e-05  2.61866683e-03
  -3.10201851e-03]
 [ 1.02538882e-03 -1.70554150e-03 -1.31447376e-04 -1.22540877e-04
   1.79736891e-03 -8.03697029e-04 -1.89841825e-03  7.49048980e-03
  -8.51154312e-03  4.20429494e-03  4.36847361e-01 -2.62732520e-01
  -2.88168085e-01 -5.25768664e-02  1.40370359e-02 -3.35924704e-01
   7.35615855e-01]
 [ 1.17742114e-03 -1.49703893e-03 -7.73447203e-04  1.81719721e-04
   1.90983779e-03 -5.52310372e-04 -1.25096055e-03  6.42772177e-03
  -4.48070839e-03  5.65981154e-03  6.22041358e-01 -3.42497500e-01
  -3.83309123e-01 -4.93531025e-02  1.25642981e-03  2.78173333e-01
  -5.18568428e-01]
 [ 6.70614019e-01  2.83671807e-01  2.46719924e-02  5.84959415e-01
   2.81437097e-01  8.53166879e-02 -1.33217513e-02  1.48603786e-01
   1.35160400e-01 -6.11876731e-04 -1.95607954e-03 -1.11699713e-03
   7.12926324e-04  3.45454736e-04 -1.73143292e-04 -3.51935301e-04
   1.59404542e-04]
 [ 1.11112714e-01  8.03795425e-02  6.61418696e-02  3.02818447e-01
  -9.23535263e-01 -1.46195111e-01  1.02603554e-01 -1.82845389e-04
  -1.70019941e-02  1.23090699e-03  2.05228258e-03 -2.48548467e-03
   5.72220327e-04 -1.95409452e-05  1.16158257e-05 -6.84689767e-05
   3.38239528e-04]
 [ 5.48419377e-02 -5.69322786e-01 -7.58609804e-01  2.53930997e-01
   7.01853015e-03 -1.68341354e-01 -5.02771148e-02  4.42263806e-02
   3.08991334e-04  2.39368446e-03 -3.66810501e-03  9.81827475e-04
  -1.11423703e-03 -8.57523167e-04  1.75384734e-04  3.93180296e-05
  -2.35141817e-04]
 [ 2.88655215e-02 -1.05991157e-01 -1.36600904e-01 -1.04560509e-02
  -1.78167886e-01  9.63409521e-01 -6.47932773e-02 -5.99821016e-02
  -2.82846916e-02 -2.56179003e-02 -1.26355478e-04 -1.59014072e-03
  -1.87398767e-03  1.73566568e-03  7.98890965e-05 -6.49515861e-04
   3.73882054e-05]
 [ 3.73421983e-03 -1.42906360e-03  2.73952993e-03  8.81926126e-04
  -6.59073333e-03  2.12473648e-02 -4.97284497e-02 -9.80514287e-03
  -9.17820347e-03  9.98363816e-01 -2.38670832e-03  4.61899122e-03
   6.34107591e-03  1.22800756e-03 -3.48333274e-04 -4.78073439e-03
  -2.34688083e-03]
 [ 2.31322104e-02  2.98378541e-02  6.03049023e-02  4.97334523e-02
  -6.63037318e-02 -8.30838667e-02 -9.49485314e-01 -2.77906906e-01
  -1.38843733e-03 -4.89597665e-02  2.14099624e-03 -1.24586263e-03
   1.97480453e-03  1.21637335e-03  3.71750371e-04  3.68841978e-04
   7.59201476e-05]
 [ 1.13882279e-03 -8.72716975e-04 -6.27829711e-04  1.06424595e-03
  -2.14768347e-04  7.38884188e-04  3.39135432e-04  6.43268494e-04
   2.35431748e-04 -6.06832667e-03  4.50239773e-01  5.71238631e-01
   1.68316946e-01 -7.71427378e-02 -3.50503648e-02 -5.99392247e-01
  -2.75948904e-01]
 [ 9.89571300e-04 -8.33784372e-04 -6.77232419e-04  1.12596356e-03
  -2.87754543e-04  1.47391910e-03  4.79948557e-04  4.20588255e-04
   1.56870869e-03  1.42525099e-03  3.68973207e-01  5.32313022e-01
   1.48251140e-01  3.40800319e-02  6.30153316e-03  6.66714234e-01
   3.35862526e-01]
 [ 2.85990622e-05  4.27502771e-04 -6.90825899e-05 -2.50575626e-05
  -1.11262887e-05  8.75295859e-05  2.48339643e-04  2.65870573e-04
   3.19680045e-04  1.25135438e-04  2.88120655e-03  2.34890666e-02
   5.39624725e-03 -2.92177694e-02  9.98878341e-01 -1.89206496e-02
  -2.09949394e-02]
 [-1.09332895e-04 -1.10753753e-03 -8.64619902e-04  5.03048239e-04
   1.27397693e-03 -1.53600043e-03  1.48915448e-03  2.44339493e-03
  -5.89963085e-03 -1.79550779e-03  1.47788276e-01 -1.24200540e-01
   2.00906626e-01  9.57409300e-01  2.77750535e-02 -6.80748442e-02
  -1.66953671e-02]
 [ 2.92390370e-01 -7.53152763e-01  5.85452368e-01  9.09567146e-03
   1.92423154e-02 -5.62238284e-03  3.55651865e-02 -5.19314381e-02
   7.79228934e-03 -2.21151059e-03 -1.27135906e-03 -2.30726926e-04
   5.39009988e-04 -1.77416158e-04  3.54158849e-04  2.03171958e-04
  -4.20330536e-04]
 [ 3.20352937e-04 -1.36502084e-03 -2.15064497e-03 -1.00480407e-03
   1.69698386e-03  1.32453885e-03  1.25776311e-03  3.12266621e-03
  -5.27982343e-03 -2.09321945e-03  2.47692379e-01 -4.33473280e-01
   8.24182373e-01 -2.64916617e-01 -1.79094197e-03  3.11565050e-02
   1.61938738e-02]]

 Eigen Values 
%s [4.30379378e+07 3.78056325e+07 6.24492891e+06 2.91852679e+06
 1.44214814e+06 6.21395725e+05 3.72541141e+05 3.27117832e+05
 3.92815684e+04 2.53168644e+04 4.26204091e+02 2.09555511e+02
 1.57391024e+02 8.08957234e+01 8.43828081e+00 3.59829140e+01
 2.69844159e+01]

2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two places of decimals only).

In [56]:
plt.plot(var_exp)
Out[56]:
[<matplotlib.lines.Line2D at 0x253279ac0a0>]

Visually we can observe that their is steep drop in variance explained with increase in number of PC's. In the above scree plot:

• 48.91 % of the total variation is explained by first Principal Component and confirmed by screeplot. • In the scree plot, the last big drop occurs between the first and second components and we choose the first component.

2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the optimum number of principal components? What do the eigenvectors indicate?

In [55]:
tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)
Cumulative Variance Explained [ 46.35921748  87.0823476   93.80920378  96.95295613  98.50639625
  99.17574569  99.57703619  99.92939803  99.971711    99.99898159
  99.99944068  99.99966641  99.99983595  99.99992308  99.99996184
  99.99999091 100.        ]
In [57]:
var_exp
Out[57]:
[46.359217475968514,
 40.72313012469262,
 6.726856181551836,
 3.1437523507326897,
 1.5534401168849268,
 0.6693494401194958,
 0.4012905048773123,
 0.35236183472146254,
 0.042312965452112654,
 0.027270591587771828,
 0.00045909467716895785,
 0.00022572711452818666,
 0.0001695370426492777,
 8.713852539458325e-05,
 3.875975066593411e-05,
 2.9066829652540624e-05,
 9.089471182084053e-06]

2.9 Explain the business implication of using the Principal Component Analysis for this case study. How may PCs help in the further analysis? [Hint: Write Interpretations of the Principal Components Obtained]

in this data set 3 different types of values are there 1)Number(4 to 5 digit) 2)Ratio(2 digit with deciaml) 3)Percentage(2 digit).Since the data in these variables are of different scales, it is tough to compare these variables.

Number: Apps,Accept,Enroll,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal. Percentage :Top10perc,Top25perc,PhD,Terminal,perc.alumni. Ratio:Grad.Rate,

Feature scaling (also known as data normalization) is the method used to standardize the range of features of data. Since, the range of values of data may vary widely, it becomes a necessary step in data preprocessing while using machine learning algorithms.

In this method, we convert variables with different scales of measurements into a single scale.

StandardScaler normalizes the data using the formula (x-mean)/standard deviation..

When it comes to choosing between Covariance vs Correlation, the latter stands to be the first choice as it remains unaffected by the change in dimensions, location, and scale, and can also be used to make a comparison between two pairs of variables. Since it is limited to a range of -1 to +1, it is useful to draw comparisons between variables across domains. However, an important limitation is that both these concepts measure the only linear relationship.

Applying zscore or using StandardScalar give us the same results

It scales the data in such a way that the mean value of the features tends to 0 and the standard deviation tends to 1

Min-Max method ensure that the data scaled to have values in the range 0 to 1

Visually we can observe that their is steep drop in variance explained with increase in number of PC's. In the above scree plot:

• 48.91 % of the total variation is explained by first Principal Component and confirmed by screeplot. • In the scree plot, the last big drop occurs between the first and second components and we choose the first component.

In [ ]:
 
In [ ]: